Feature Selection, Model Selection, and Tuning (FMST) - Employee Promotion Eligibility Prediction at Likoma Company - By David Salako.

Background and Context

Employee Promotion means the ascension of an employee to higher ranks, this aspect of the job is what drives employees the most. The ultimate reward for dedication and loyalty towards an organization and HR team plays an important role in handling all these promotion tasks based on ratings and other attributes available.

The HR team in Likoma company stored data of promotion cycle last year, which consists of details of all the employees in the company working last year and also if they got promoted or not, but every time this process gets delayed due to so many details available for each employee - it gets difficult to compare and decide.

So this time HR team wants to utilize the stored data to make a model, that will predict if a person is eligible for promotion or not.

Problem Statement:

As a data scientist at Likoma company, I need to design a model that will help the HR team predict if an employee is eligible for promotion or not.

Objective:

To build a model to explore, visualize, predict and identify the employees who have a higher probability of getting promoted. Subsequently optimize the classification model using appropriate techniques and finally generate a set of insights and recommendations that will help the company and its human resources department.

Data Dictionary & Description:

Importing the necessary libraries

Import Dataset

View the first and last 50 rows of the dataset.

The shape of the dataset.

13 columns and 54808 rows of data.

Change the column names to uppercase format so that they are easier to read and identify.

Drop the original columns as they are currently duplicated.

Check the data types of the columns of the dataset.

Check the conversion has taken place in the REGION variable.

Convert the REGION variable to a categorical data type.

Check the data type of the REGION column of the dataset.

Confirm that the changes in the EDUCATION, GENDER, RECRUITMENT_CHANNEL, and DEPARTMENT variables have been made.

Reconfirm the counts of NULL records that currently exist in the dataset.

Continuous columns

Categorical columns

Number of observations in each category.

Exploratory Data Analysis (EDA).

Univariate Analysis.

Boxplot and Histogram plots for the important independent numeric variables.

Observations on AGE.

Observations on AVG_TRAINING_SCORE.

Observations on PREVIOUS_YEAR_RATING.

Let us define a function to create barplots for the categorical variables indicating the percentage of each category for each of the variables.

Bivariate Analysis

For ease of analytical computation, I will temporarily convert all the string values in the categorical variables to numeric values. Then their data types will be converted to int64 data type.

Correlation matrix for the numeric independent variables.

Pairplot

The pairplot visualization displays similar results and observations to the earlier correlation matrix.

Reverse the earlier updates to the categorical variables back to strings rather than numbers. This is in preparation for the one-hot encoding transformation coming later.

Define a function to plot stacked bar charts

IS_PROMOTED vs DEPARTMENT

IS_PROMOTED vs REGION

IS_PROMOTED vs EDUCATION

IS_PROMOTED vs GENDER

IS_PROMOTED vs RECRUITMENT_CHANNEL

IS_PROMOTED vs NO_OF_TRAININGS

IS_PROMOTED vs AGE

IS_PROMOTED vs PREVIOUS_YEAR_RATING

IS_PROMOTED vs AWARDS_WON

IS_PROMOTED vs AVG_TRAINING_SCORE

IS_PROMOTED vs LENGTH_OF_SERVICE

IS_PROMOTED vs AGE

IS_PROMOTED vs AVG_TRAINING_SCORE

IS_PROMOTED vs LENGTH_OF_SERVICE

IS_PROMOTED vs REGION

IS_PROMOTED vs GENDER

IS_PROMOTED vs DEPARTMENT

Percentage of outliers, in each column of the data, using the interquartile range (IQR).

Removing the outliers

Data Preparation

Drop the EMPLOYEE_ID variable as it does not add any value to the model that is being built to predict promotion recommendations for employees.

Split the data into train and test sets

Missing Values

PREVIOUS_YEAR_RATING also has 4124 missing records that are currently represented by zeros.

Missing Value Treatment With K-Nearest Neighbors (KNN) Imputation Method

I will pass numerical values for each categorical column for the KNN imputation operation via label encoding.

Checking inverse mapped values/categories.

End of Missing Values Imputation

Encoding categorical variables

Building the Model(s)

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting an employee will be recommended for promotion and the employee is not recommended for promotion.
  2. Predicting an employee will not be recommended for promotion and the employee is recommended for promotion.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

I will create two functions to calculate different metrics as well as a confusion matrix. This will reduce redundancy so that the same code is not used repeatedly for each model.

Model Building

Performance comparison

Hyperparameter Tuning

Tuning XGBoost

Tuning AdaBoost

Tuning Gradient Boosting classifier

Oversample the train data

Fitting the tuned models on oversampled data

XGBoost

XGBoost Oversampled: Recall has improved slightly on the validation set from 0.589 to 0.850. Overfitting in training.

AdaBoost

AdaBoost Oversampling: Recall is overfitting in training and has low Recall of 0.327 in the validation set.

Gradient Boosting

Gradient Boosting Oversampling: Recall has dropped from 0.813 to 0.757 in the validation set.

Undersample the train data

Fitting the tuned models on undersampled data

XGboost

XGBoost Undersampled: Overfitting Recall in training and 0.983 Recall in validation, both at the expense of Accuracy, Precision, and F1 Score.

Adaboost

AdaBoost Undersampled: In training all the scores hover around the 0.7 level, in validation the scores take a drop in all but Accuracy.

Gradient Boosting

GradientBoost Undersampled: Low Recall scores in both training and validation.

Model Performance comparison

Feature Selection

Using RandomForest to select features based on feature importance. We calculate feature importance using node impurities in each decision tree. In Random forest, the final feature importance is the average of all decision tree feature importance.

I will also use LightGBM.

Building the final model with Column Transformer

First let us do some basic pre-processing.

The best model (in this scenario XGBoost) we need to process with is already known, so 3 parts (training, test, and validation) are no longer required.

Business Recommendations